1 Introduction to My AFI Project
by Caleb J. Picker January 18, 2023, created using Quarto in RStudio
Sing the Sorrow is my favorite album of all time. I’ve listened to it 1000s of times over the last 20 years since its release on March 11, 2003. If you’re like me and you love AFI as much as I do, then you likely also love the rest of their discography. And if you’re really like me, then you also love data science, research, and statistics.
As the 20th Anniversary of the Sing the Sorrow concert looms on the horizon, I was inspired to combine both of my passions to analyze the content of AFI’s lyrics from their major album releases (and to upskill my data science skillset in the realm of natural language processing).
So let’s see what modern statistics can reveal about the meanings of AFI’s lyrics from their major album releases!
To preview, I present the most frequently used words overall and by album. Then, I perform a latent semantic analysis to help uncover how co-occurrences of words relate across their discography. It’s pretty cool what this can reveal, and I can’t wait to show you. This analysis involves a word cloud, the hidden meanings of words and how they’re used in different contexts, and I select certain words that interest me and project them onto the semantic space generated by all the lyrics. For example, is the word “star” more closely related to “born” or “death”?
2 AFI’s Most Frequently Used Words
In this section, I calculate the most frequently used words and then I group them and show the top 10 most frequently used words by major album release.
To accomplish this, I used the Lyrics Genius API wrapper in the geniusrpackage download AFI’s entire catalog of songs (following https://www.r-bloggers.com/2021/01/scraping-analysing-and-visualising-lyrics-in-r/).
2.1 Songs Removed from Analysis
Then I filtered to only those songs in their major album release. This means I removed almost 100 songs from this analysis, as AFI’s catalog contains about 240 songs. See below for list of removed songs.
For transparency, I next show the most frequent words with and without stop words.
2.2 Word Frequency
To begin pre-processing the lyrics, I first removed lines of words that repeated (e.g., choruses), and then I removed stop words (e.g., “I”, “you”, “the”, “a”). Therefore, this section shows word frequency both with stop words (Table 1-left) and without stop words (Table 2-right). (Note: Each table is interactive and searchable!)
As an aside, an analysis with stop words would definitely be interesting. Davey uses pronouns and direct objects in unique ways, and that could be an entire analysis of its own: The stop words could be contextualized, for example, using ngrams.
Without further ado, let’s reveal AFI’s most frequently used words! Looking at Table 2 (right), it seems like the top 5 words are feel (75), love (70), eyes (60), time (55), and life (51), and heart (49).
Joining, by = "word"
Word frequency across AFI’s major album releases.
2.3 Top 10 Most Frequently Used Words (by Album)
In this section, I grouped the lyrics by major album release and sorted them (following https://drsimonj.svbtle.com/ordering-categories-within-ggplot2-facets). This allowed me to create a plot of AFI’s top 10 most frequently used words faceted by major album release. I tried my best to theme up the image and use Sing the Sorrow colors. I also tried to select colors from each album to represent the data as bars.
Following on the theme, let’s look at Sing the Sorrow (the blood-red album in the second row, second column). The top 6 words are grey, tonight, dance, step, lay, inside, and heart. Grey likely comes from This Celluloid Dream (“All the colors (all grey) upon leaving (all grey) all will turn to grey.”)
Feel free to look at the rest of the albums! It’s pretty interesting! Let me know what else you discover!

3 Latent Semantic Analysis
In this section, I took a slightly different approach to calculating frequency. Whereas in the previous sections, I calculated raw frequency, in this and the following sections I wanted to capture the reasons why Davey used the words he did. To start the process, therefore, I followed Gefen, Endicott, Fresneda, Miller, and Larsen (2017) process for a latent semantic analysis. Briefly, a latent semantic analysis analyzes the similarities among word usage to discover how the same word, for example, can be used to mean different things. Does Davey use the word “star” to more closely resemble “born” or “death”?
3.1 Technical Details
If you’re not into techincal details, feel free to skip this section and go straight to Cosine Similarity (Getting More Interesting!). Gefen et al.’s (2017) process allowed me to locally and globally weight the raw frequencies. The local weighting algorithm weights words more heavily that appear more often within a song (presumably because they’re more important); the global weighting algorithm weights words less heavily if words appear more often across all songs (presumably because they’re less important, like the word “the”). These weightings were then multiplied together.
To start the process, I split up all the songs from the major album releases into separate text files. Then I counted the raw frequencies like before. Then I locally and globally weighted the raw frequencies. The rest of the analyses for this post rely on these weighted frequencies.
Next, I imported all the .txt files (one .txt file is one song) to make a collection of documents called a corpus. I set the minimum global frequency to 1. There are 1758 vocabulary words and 148 songs. After that, I created several matrices that have song loadings and document loadings (similar to factor analysis). However, for brevity, I will only present the results based on cosine similarity.
3.2 Cosine Similarity (Getting More Interesting!)
Similar to concepts like correlation, here I present results based on a similarity metric called cosine similarity. Cosine similarity ranges between -1 and +1:
closer to -1 means the words are more opposite,
closer +1 means the words are more similar, and
closer to 0 means the words are unrelated.
Note that the diagonal contains values of 1, which means the comparison is between the word/song and itself (similar to a correlation table).
Table 3 shows cosine similarity among all words across all albums. Table 4 shows cosine similarity among all songs. I highly recommend you explore this to your own interests. I’ll analyze a small sample from each Table to show you the cool stuff you can glean!